Data Cleaning*
Lecture 6
*for very clean data
Review:
Reading files
Data Cleaning
- Why does data have to be “cleaned”?
- Our goal: make sense of data
- Model relationships, find correlations, p-values
- Raw data isn’t always the best for this
Issues with Raw Data
- Missing data
- Categorical vs continuous variables
- Grouping data
We need to deal with these issues to do statistical modeling
and testing (will learn after midterm)
Data Cleaning:
Missing Values
Issues with Raw Data
- Missing data
- Example:
- Looking at how people’s vitamin D levels and height are
related
- If either of those values are missing, we want to exclude
them from our data
Missing values
- R uses NA to represent a missing value
- Reading in files: tell R what missing data looks like using
na.strings = “____”
- Ex: “” means that empty cells are NA (missing)
data <- read.csv(“file path here”, na.strings="")
- Get rid of rows where there is a NA
na.omit(data)
variable_name <- na.omit(data)
Full dataset
After na.omit
Missing values
- na.omit removes row if any of the values in that row are NA
- What if you didn’t care if certain columns had NA?
- Ex: height vs vitamin D, don’t care if sex is missing
variable_name <- data[complete.cases(data[ , c("VITAMIN_D", "HEIGHT")]), ]
- Replace data with the variable containing your data
- c(“column1”, “column2”, …) insert columns you can’t have NA in
- The comma at the end is not a mistake
Full dataset
After selective NA omit
Missing values Code
data <- read.csv("C:/Users/miche/Desktop/Missing Data
Example.csv", na.strings="")
all_NA_omitted_data <- na.omit(data)
omit_NA_if_unknown_vitD_height_data <-
data[complete.cases(data[ , c("VITAMIN_D", "HEIGHT")]), ]
Your Turn: heart_missing.csv
- heart_missing.csv contains “NA” if data is missing
- Import this dataset, omitting all participants for which
data is incomplete
Hint:
data <- read.csv("file path", na.strings= “__”)
all_NA_omitted_data <- na.omit(data)
data.csv Exercise
We could have combined the read() and na.omit() functions
Functions and missing values
Some functions have arguments
that allow you to specify what to do
with missing values
Data Manipulation:
Categorical Variables
Issues with Raw Data
- Continuous vs categorical variables
- Continuous: infinite values within a range
- Discrete/categorical: finite set of values
- Example:
- Height: continuous
- Sex (0= Female, 1=Male): categorical
Categorical Variables in R
- Numeric variables by default continuous
- Ex: BMI = 29.5, 39.3, 19.1211, …
- String variables by default categorical
- Ex: M, F, male, female, hypotensive, hypertensive,
- What if you wanted a numeric variable to be categorical?
- Ex: chromosome number = 1, 2, … 23
- Ex: sex = 0, 1
Revisiting heart_missing.csv
heart$chol:
cholesterol
levels
heart$sex:
0 = female
1 = male
Categorical Variables in R
factor()
- new_categorical_variable <- factor(original_variable)
- Returns a categorical version of a numeric variable
- We can store this categorical version in a new variable
Example:
Optional: we can add labels specify what 0 and 1 stand for, since
0 comes before 1, “Female” comes before “Male”
factor()
See the difference?
Your Turn: heart_missing.csv
- Smoking is stored as 0 = non-smoker, 1 = smoker
- Turn this into a categorical variable called
smoking_factor
Hint:
new_categorical_variable <- factor(original_variable)
heart_missing.csv smoking status
Optional: can add labels
Data Manipulation:
Grouping Data
Issues with Raw Data
- Want to separate data into groups
- Example:
- BMI according to
World Health Organization
- Perhaps we want to look
at cholesterol levels for people
in the obese category
Grouping data
- We already learned one way: conditionals!
ifelse(condition, what happens if true, what happens if false)
- We can nest these together if there are multiple conditions
Example:
factor(ifelse(heart$BMI < 18.5, "underweight",
ifelse(heart$BMI < 25, "normal",
ifelse(heart$BMI < 30, "pre-obese", "obese"))))
Grouping data
- Another way: boolean logic, 0 = false, 1 = true
- (1 == 1) + (3 < 4) + (4 >= 5) = 2
- True + true + false = 2
Example:
factor((heart$BMI>=18.5) + (heart$BMI>=25) + (heart$BMI>=30),
labels=c("Underweight","Normal","Pre-obese","Obese"))
Two methods to group BMI
(Alphabetical order)
(BMI order)
The order matters for modeling
But for this class, don’t worry about it,
as these both tell us the same
information
subset()
subset(variable with original dataset, conditional)
- We can get a subset (selective group) of our data using
subset()
Example
subset()
subset(variable with original dataset, conditional)
- We can get a subset (selective group) of our data using
subset()
Example
heart$chol vs BMI_obese_data$chol
Will produce different vectors
Another subset() example
Your Turn: heart_missing.csv
- Does age increase with cholesterol group?
- chol stores cholesterol levels
- Mayo clinic cholesterol levels classified according to table below
- Compare the ages of patients in cholesterol level group
Hint:
factor()
subset()
ifelse()
Heart.csv: cholesterol levels and age
Your Turn: heart_missing.csv
- Could cholesterol level be related to BMI?
- Compare summary of BMI between 3 cholesterol groups
Hint:
factor()
subset()
ifelse()
Heart_missing.csv: cholesterol levels and BMI
Your Turn: sleep
- Compare mean sleep hour difference between control and
experimental group. Does the experimental group have a
larger average increase in sleep than the control group?
Sleep: comparing means between groups
Logistics
Assignment due (see syllabus)
Exam review next week, midterm on Friday, 10/23
Take-home exam: 24 hours to complete (due 11:59 PM
on 10/23)
How to prepare?
Homeworks (do them! Check your solutions)
In-class exercises
List of topics on syllabus (from week 1, up to &
including this lecture)